We have now preprocessed and merged our single cell data. The next step is to analyse the data for cell type identification, identification of marker genes and et cetera. We will focus on some simple steps of downstream analysis for single cell RNA-sequencing (scRNA-seq) data as shown below.
# This section is to install package that will be required to proceed with this section of the workshop.
# If you do not have "devtools" installed, uncomment the following line and run on your console
# install.packages("devtools")
# devtools::install_github("SydneyBioX/scdney")
We will be using some of the functions we developed in our scdney package. You may visit our package website for the vignette and further details about scdney.
# We will subset Su et al. and Yang et al. datasets.
ids = which(colData(sce_scMerge)$batch %in% c("GSE87795", "GSE90047"))
lab = colData(sce_scMerge)$cellTypes[ids]
nCs = length(table(lab))
mat = SummarizedExperiment::assay(sce_scMerge, "scMerge")[,ids]
The data from single cell RNA-sequencing experiment does not provide cell type information for individual cells. We need to identify cell types of indivdual cells in our data with bioinformatics analysis.
Before we identify the cell types in the dataset, we need to first identify how many distinct group of population we can find from our data. One common method to achieve this is by clustering. Clustering is a type of machine learning technique to group similar samples (cell) to the same group and partition samples that are different by comparing their feature information (gene expression).
To optimise clustering method, algorithm and similarity metric are two key components that affects clustering performance.
In our recent study, we found pearson correlation to be optimal similarity metric for comparing single cell RNA-seq data ( Kim et al., 2018 ). Therefore in this workshop, we will utilise scClust in our scdney package, which implemented 2017 Nature methods clustering algorithm SIMLR with pearson correlation.
Typical clustering methods (except for some methods like hierarchical clustering) require users to specify number of distinct groups (k) to cluster from your data. In an unsupervised setting, we do not know the exact number and therefore we will run clustering for various number of k.
# We will not run for various `k` to save time. Instead, we will load pre-computed results for `k` between 3 to 8
# This is an easy way to run `scClust` for k = (3,4,5,6,7,8).
# all_k = 3:8
# simlr_results = sapply(as.character(all_k), function(k) {
# scClust(mat, as.numeric(k), similarity = "pearson", method = "simlr", seed = 1, cores.ratio = 0, geneFilter = 0)
# }, USE.NAMES = TRUE, simplify = FALSE)
# For demonstration purpose, we will run k = 6 (which is actually the number of cell types in our dataset)
simlr_result_k6 = scClust(mat, 6, similarity = "pearson", method = "simlr", seed = 1, cores.ratio = 0, geneFilter = 0)
## Computing the multiple Kernels.
## Performing network diffiusion.
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Iteration: 9
## Iteration: 10
## Performing t-SNE.
## Epoch: Iteration # 100 error is: 0.1527458
## Epoch: Iteration # 200 error is: 0.1218641
## Epoch: Iteration # 300 error is: 0.1112158
## Epoch: Iteration # 400 error is: 0.1059176
## Epoch: Iteration # 500 error is: 0.1026344
## Epoch: Iteration # 600 error is: 0.1003277
## Epoch: Iteration # 700 error is: 0.0985936
## Epoch: Iteration # 800 error is: 0.09722343
## Epoch: Iteration # 900 error is: 0.09610318
## Epoch: Iteration # 1000 error is: 0.09516611
## Performing Kmeans.
## Performing t-SNE.
## Epoch: Iteration # 100 error is: 11.56649
## Epoch: Iteration # 200 error is: 0.2845774
## Epoch: Iteration # 300 error is: 0.2560054
## Epoch: Iteration # 400 error is: 0.2410104
## Epoch: Iteration # 500 error is: 0.2361371
## Epoch: Iteration # 600 error is: 0.2331395
## Epoch: Iteration # 700 error is: 0.231009
## Epoch: Iteration # 800 error is: 0.229431
## Epoch: Iteration # 900 error is: 0.2281798
## Epoch: Iteration # 1000 error is: 0.2271669
# load("~/Dropbox (Sydney Uni)/workdir/singleCell/workshop/data/simlr_result_k6.RData")
load(paste(getwd(), "/data/simlr.results.RData", sep = ""))
k?If our k is correct, we expect that the each clusters are closely packed together and the distance between the clusters, within-cluster sum of squares (WSS) are expected to be small. Thus, we will select the k with a small total WSS (compact clusters). This is called the “Elbow” method.
# Find total WSS from all cluster outputs
all_wss = sapply(simlr_results, function(result) {
sum(result$y$withinss)
}, USE.NAMES = TRUE, simplify = TRUE)
plt.dat = data.frame(
k = names(all_wss),
total_wss = all_wss
)
ggplot(plt.dat, aes(x = as.numeric(as.character(k)), y = total_wss)) +
geom_point(size = 3) +
stat_smooth(method = loess, col = "red", method.args = list(degree = 1)) +
ylab("Total WSS") +
ggtitle("Compare Total WSS for each 'k'")
As shown in this plot, the graph begin to plateau from k = 5. We can estimate that k is around 5 or 6. We can further investigate using silhouette scores or other metrics but using our t-SNE plot (and with a little cheating) we will estimate k = 6.
k with compactness of the clusters. What other measure can we use to determine k?For the purpose of demonstration, we would like to highlight the effect of similarity metric to your cluster output.
# To run scClust with euclidean distance, uncommnet the following lines.
# simlr_result_eucl_k6 = scClust(mat, 6, similarity = "euclidean", method = "simlr", seed = 1, cores.ratio = 0, geneFilter = 0)
# for convenience, we will load our pre-computed result
load(paste(getwd(), "/data/simlr_result_eucl_k6.RData", sep = ""))
# create tsne object
set.seed(123)
tsne_result = Rtsne(t(mat), check_duplicates = FALSE)
#################################################
tmp_lab = as.numeric(factor(lab))
pear_cluster = mapvalues(
simlr_result_k6$y$cluster,
from = c(1,2,3,4,5,6),
to = c(2,3,1,4,6,5)
)
eucl_cluster = mapvalues(
simlr_result_eucl_k6$y$cluster,
from = c(1,2,3,4,5,6),
to = c(6,2,4,3,1,5)
)
#################################################
plt.dat = data.frame(
tsne1 = rep(tsne_result$Y[,1], 3),
tsne2 = rep(tsne_result$Y[,2], 3),
cluster = factor(c(tmp_lab, pear_cluster, eucl_cluster)),
label = factor(c(rep("Truth", length(lab)), rep("Pearson", length(lab)), rep("Euclidean", length(lab))), levels = c("Truth", "Pearson", "Euclidean"))
)
ggplot(plt.dat, aes(x = tsne1, y = tsne2, colour = cluster, colors)) +
geom_point(size = 2) +
ggtitle("t-SNE plot") +
facet_grid(cols=vars(label)) +
theme(legend.position = "none")
[Optional] We can evaluate the clustering performance using our ground truth.
# ARI
ari = c(mclust::adjustedRandIndex(lab, simlr_result_eucl_k6$y$cluster), mclust::adjustedRandIndex(lab, simlr_result_k6$y$cluster))
# NMI
nmi = c(igraph::compare(as.numeric(factor(lab)), simlr_result_eucl_k6$y$cluster, method = "nmi"), igraph::compare(as.numeric(factor(lab)), simlr_result_k6$y$cluster, method = "nmi"))
plt.dat = data.frame(
dist = rep(c("Euclidean", "Pearson"), 2),
value = c(ari, nmi),
eval = rep(c("ARI", "NMI"), each = 2)
)
ggplot(plt.dat, aes(x = dist, y = value, fill = dist)) +
geom_bar(stat="identity") +
facet_grid(col = vars(eval)) +
labs(x = "Similarity metrics", y = "Evalution score", title = "Affect of similarity metrics in scRNA-seq data")
We have grouped all the cells to 6 distinct groups. We now want to identify what these groups are, i.e. what cell types are there in my dataset? Thus, what defines a cell type?
We can use marker genes to identify cell types.
Here we provide a function that allow one to find differentially expressed gene between a cluster and the remaining clusters. The input of this function is the cluster id. The output is a list of gene and its associated p - value.
# This function is specific for scClust ouptut object that ran with "SIMLR"
findmarker <- function(mat, scClust_obj, cluster_id){
#group 1 is the cluster of interest
group1 = which(scClust_obj$y$cluster == cluster_id)
group1 = colnames(mat)[group1] #find out which cell belongs to group 1
group_label <- ifelse(colnames(mat) %in% group1, "group1", "group2") #if the cell is in group 1 ,create a label called group 1, else create a label called group 2
#create a dataframe , with cell name and the group that this cell belongs to
group_label <- data.frame(group_label, colnames(mat))
rownames(group_label) <- group_label[,2]
colnames(group_label) <- c("group","wellKey")
#MAST requires a wellkey (which essentially is the name of the cell)
#create a dataframe, with gene name
gene_label <- data.frame(rownames( mat))
colnames(gene_label) <- "primerid"
rownames(gene_label) <- gene_label[, 1]
#MAST requires a primerid (which essentially is the gene name)
#create a MAST object, with the group id of each cell and the gene name
sca <- MAST::FromMatrix(
exprsArray = mat,
cData = group_label,
fData = gene_label
)
#this is the differential expression model (called a hurdle model)
#(see https://www.bioconductor.org/packages/release/bioc/vignettes/MAST/inst/doc/MAITAnalysis.html#4_differential_expression_using_a_hurdle_model)
zlmCond <- MAST::zlm(~group, sca)
summaryCond <- summary(object = zlmCond, doLRT = 'groupgroup2')
#now extract the information
summaryDT <- data.frame(summaryCond$datatable)
return_val <-data.frame(summaryDT[summaryDT[, "component"] == "H", 1], summaryDT[summaryDT[, "component"] == "H", 4])
#We select the "H" rows, the "H" means we want hurdle P values
#extract both the gene name and the P value associated with this gene
colnames(return_val) <- c("gene", "P_value")
#order by the p value , from the most significant
return_val <- return_val[order(return_val$P_value), ]
return (return_val)
}
Here we provide an example of using the findmarker gene function.
To find out the marker gene in cluster 4, we type in 4 in the findmarker function. We then look at the top 10 genes ranked by p-value. We can then use ggplot to visualise the distribution of one of the genes across the dataset.
df<- data.frame(tsne_result$Y)
df<- cbind(df, as.factor(simlr_result_k6$y$cluster) )
df = df %>%
dplyr::rename(
tsne1 = X1,
tsne2 = X2,
cluster = `as.factor(simlr_result_k6$y$cluster)`
)
marker <- findmarker(mat, simlr_result_k6, 4)
## Assuming data assay in position 1, with name et is log-transformed.
##
## Done!
## Combining coefficients and standard errors
## Calculating log-fold changes
## Calculating likelihood ratio tests
## Refitting on reduced model...
##
## Done!
marker[1:10,]
## gene P_value
## 12739 Gys2 6.068095e-153
## 19151 Shbg 2.513240e-134
## 12924 Hgd 1.577501e-126
## 19364 Slc27a2 8.229521e-124
## 6096 Gcgr 7.663681e-122
## 16209 Otc 1.338106e-115
## 12842 Hc 1.016114e-113
## 15057 Mogat2 1.509759e-113
## 12729 Gulo 6.801648e-113
## 14081 Lect2 1.976735e-112
index <- which(rownames(sce_scMerge[,ids]) == "Gys2")
# df_Gys2 <- cbind(df, mat[index,] )
df_Gys2 = df
df_Gys2$Gys2 = mat[index,]
ggplot(data =df_Gys2, mapping = aes(x=tsne1, y = tsne2, colour = Gys2) ) +
geom_point(alpha=0.5) +
scale_color_viridis() +
theme_bw() +
labs(col="Gys2 expression", x = "tsne1", y = "tsne2")
Here we repeat the analysis as above, but for cluster 3. See if you can understand the output.
marker <- findmarker(mat, simlr_result_k6, 3)
## Assuming data assay in position 1, with name et is log-transformed.
##
## Done!
## Combining coefficients and standard errors
## Calculating log-fold changes
## Calculating likelihood ratio tests
## Refitting on reduced model...
##
## Done!
marker[1:10,]
## gene P_value
## 88 1700011H14Rik 1.187064e-210
## 5220 Erich5 8.248276e-200
## 1257 Adamts16 4.544699e-193
## 3539 Cldn3 2.663833e-177
## 20941 Tmem27 2.406824e-172
## 11554 Gm609 7.277614e-164
## 14560 Marveld3 3.901531e-152
## 21906 Vnn1 5.224090e-147
## 13356 Igfbp7 5.040702e-134
## 19960 Spp1 8.186524e-133
index <- which(rownames(sce_scMerge[,ids]) == "Erich5")
# df_erich<- cbind(df, mat[index,] )
df_erich <- df
df_erich$erich = mat[index,]
ggplot(data =df_erich, mapping = aes(x=tsne1, y = tsne2, colour = erich) ) +
geom_point(alpha=0.5)+scale_color_viridis()+
theme_bw()+
labs(col="Erich5 expression")
In Q2, we have identified some interesting marker genes from our dataset. If we have a gene that we know, and we want to identify the its expression pattern in our dataset, we can also visualise the distribution. For example, Hnf4a has been stated in literature as a marker gene for hepatoblast cell.
The figure on the left highlight cluster 4 and the figure on the right highlight the expression of Hnf4a.
This suggests that cluster 4 could belong to hepatoblast cell.
fig1 <- ggplot(data = df, mapping = aes(x=tsne1, y = tsne2) ) +
geom_point(aes(color = ifelse(cluster == 4, 'Yellow', 'Purple') ) ,alpha=0.4) + scale_colour_viridis_d() +
ggtitle("Cluster 4") +
xlab("") +
ylab("") +
labs(colour = "") +
theme(legend.title=element_blank()) +
theme_bw()+ guides(color=FALSE)
index <- which(rownames(sce_scMerge[,ids]) == "Hnf4a")
df_hnf4a <- df
df_hnf4a$hnf4a = mat[index,]
fig2 <- ggplot(data =df_hnf4a, mapping = aes(x=tsne1, y = tsne2, colour = hnf4a) ) +
geom_point(alpha=0.5) +
scale_color_viridis() +
theme_bw()+labs(col="expression") +
ggtitle("Hnf4a expression pattern")
ggarrange(fig1,fig2, ncol= 2, nrow = 1)
See if you can understand the output of the following code.
fig1<- ggplot(data =df, mapping = aes(x=tsne1, y = tsne2) ) +
geom_point(aes(color = ifelse(cluster == 3, 'Yellow', 'Purple') ) ,alpha=0.4) +
scale_colour_viridis_d() +
ggtitle("Cluster 3") +
labs(colour = "")+
theme(legend.title=element_blank())+
theme_bw()+
guides(color=FALSE)
index <- which(rownames(sce_scMerge[,ids]) == "Epcam")
df_epcam<- cbind(df, mat[index,] )
fig2 <- ggplot(data =df_epcam, mapping = aes(x=tsne1, y = tsne2, colour = `mat[index, ]`) ) +
geom_point(alpha=0.5)+
scale_color_viridis()+
theme_bw()+
labs(col="expression")+
ggtitle("Epcam expression pattern")
ggarrange(fig1,fig2, ncol= 2, nrow = 1)
As you may have already noticed from above steps, unsupervised clustering methods alone are not perfect in capturing cell type information. Using this step iteratively, we need to refine our cell type information.
NOTE: From this step, we will assume we have correctly refined our cell type information from above steps and we will use our “known” cell type information.
plt.dat = data.frame(
table(lab)
)
ggplot(plt.dat, aes(x = lab, y = Freq, fill = lab)) +
geom_bar(stat = "identity") +
labs(x = "Cell types", y = "Frequency", title = "Composition of cell types")
We observe that hepatoblast/hepatocyte is the largest population.
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] plyr_1.8.4 ggpubr_0.2
## [3] magrittr_1.5 viridis_0.5.1
## [5] viridisLite_0.3.0 MAST_1.10.0
## [7] ggplot2_3.1.1 cluster_2.0.9
## [9] Rtsne_0.15 mclust_5.4.3
## [11] scdney_0.1.2 edgeR_3.26.4
## [13] limma_3.40.2 dplyr_0.8.1
## [15] SingleCellExperiment_1.6.0 SummarizedExperiment_1.14.0
## [17] DelayedArray_0.10.0 BiocParallel_1.18.0
## [19] matrixStats_0.54.0 Biobase_2.44.0
## [21] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0
## [23] IRanges_2.18.1 S4Vectors_0.22.0
## [25] BiocGenerics_0.30.0
##
## loaded via a namespace (and not attached):
## [1] amap_0.8-17 minqa_1.2.4 colorspace_1.4-1
## [4] class_7.3-15 ggridges_0.5.1 htmlTable_1.13.1
## [7] XVector_0.24.0 base64enc_0.1-3 rstudioapi_0.10
## [10] mice_3.5.0 prodlim_2018.04.18 manipulate_1.0.1
## [13] mvtnorm_1.0-10 lubridate_1.7.4 codetools_0.2-16
## [16] splines_3.6.0 doParallel_1.0.14 knitr_1.23
## [19] Formula_1.2-3 nloptr_1.2.1 caret_6.0-84
## [22] clusteval_0.1 broom_0.5.2 compiler_3.6.0
## [25] backports_1.1.4 assertthat_0.2.1 Matrix_1.2-17
## [28] lazyeval_0.2.2 acepack_1.4.1 htmltools_0.3.6
## [31] tools_3.6.0 igraph_1.2.4.1 gtable_0.3.0
## [34] glue_1.3.1 GenomeInfoDbData_1.2.1 reshape2_1.4.3
## [37] Rcpp_1.0.1 nlme_3.1-140 iterators_1.0.10
## [40] timeDate_3043.102 xfun_0.7 gower_0.2.1
## [43] stringr_1.4.0 lme4_1.1-21 dendextend_1.12.0
## [46] pan_1.6 zlibbioc_1.30.0 MASS_7.3-51.4
## [49] scales_1.0.0 ipred_0.9-9 doSNOW_1.0.16
## [52] expm_0.999-4 RColorBrewer_1.1-2 yaml_2.2.0
## [55] gridExtra_2.3 rpart_4.1-15 latticeExtra_0.6-28
## [58] stringi_1.4.3 randomForest_4.6-14 foreach_1.4.4
## [61] blme_1.0-4 e1071_1.7-2 checkmate_1.9.3
## [64] boot_1.3-22 lava_1.6.5 rlang_0.3.4
## [67] pkgconfig_2.0.2 bitops_1.0-6 evaluate_0.14
## [70] lattice_0.20-38 purrr_0.3.2 labeling_0.3
## [73] recipes_0.1.5 htmlwidgets_1.3 cowplot_0.9.4
## [76] tidyselect_0.2.5 R6_2.4.0 snow_0.4-3
## [79] DescTools_0.99.28 generics_0.0.2 Hmisc_4.2-0
## [82] mitml_0.3-7 pillar_1.4.1 foreign_0.8-71
## [85] withr_2.1.2 abind_1.4-5 survival_2.44-1.1
## [88] RCurl_1.95-4.12 nnet_7.3-12 tibble_2.1.2
## [91] crayon_1.3.4 jomo_2.6-8 rmarkdown_1.13
## [94] locfit_1.5-9.1 grid_3.6.0 data.table_1.12.2
## [97] ModelMetrics_1.2.2 digest_0.6.19 tidyr_0.8.3
## [100] munsell_0.5.0